Computer Science Technical Report Approximating a Policy Can be Easier Than Approximating a Value Function

نویسنده

  • Charles W. Anderson
چکیده

Value functions can speed the learning of a solution to Markov Decision Problems by providing a prediction of reinforcement against which received reinforcement is compared. Once the learned values relatively reect the optimal ordering of actions, further learning is not necessary. In fact, further learning can lead to the disruption of the optimal policy if the value function is implemented with a function approximator of limited complexity. This is illustrated here by comparing Q-learning (Watkins, 1989) and a policy-only algorithm (Baxter & Bartlett, 1999), both using a simple neural network as the function approximator. A Markov Decision Problem is shown for which Q-learning oscillates between the optimal policy and a sub-optimal one, while the direct-policy algorithm converges on the optimal policy.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Approximating the step change point of the process fraction non conforming using genetic algorithm to optimize the likelihood function

Control charts are standard statistical process control (SPC) tools for detecting assignable causes. These charts trigger a signal when a process gets out of control but they do not indicate when the process change has begun. Identifying the real time of the change in the process, called the change point, is very important for eliminating the source(s) of the change. Knowing when a process has ...

متن کامل

The Structure of Bhattacharyya Matrix in Natural Exponential Family and Its Role in Approximating the Variance of a Statistics

In most situations the best estimator of a function of the parameter exists, but sometimes it has a complex form and we cannot compute its variance explicitly. Therefore, a lower bound for the variance of an estimator is one of the fundamentals in the estimation theory, because it gives us an idea about the accuracy of an estimator. It is well-known in statistical inference that the Cram&eac...

متن کامل

A Multiprocessor System with Non-Preemptive Earliest-Deadline-First Scheduling Policy: A Performability Study

This paper introduces an analytical method for approximating the performability of a firm realtime system modeled by a multi-server queue. The service discipline in the queue is earliestdeadline- first (EDF), which is an optimal scheduling algorithm. Real-time jobs with exponentially distributed relative deadlines arrive according to a Poisson process. All jobs have deadlines until the end of s...

متن کامل

Error bounds in approximating n-time differentiable functions of self-adjoint operators in Hilbert spaces via a Taylor's type expansion

On utilizing the spectral representation of selfadjoint operators in Hilbert spaces, some error bounds in approximating $n$-time differentiable functions of selfadjoint operators in Hilbert Spaces via a Taylor's type expansion are given.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000